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Server  T  rends 

Frequency  no  longer  increasing 

Logic  speed  scaled  faster  than  memory  bus 
(Processor  clocks  /  Bus  clock)  consumes  bandwidth 

More  speculation;  attempts  to  prefetch 

•  Wrong  guesses  increase  miss  traffic 

Shortening  linesize  limited  by  directory  as  cache  size  grows 
But  doubling  linesize  doubles  bus  occupancy 

Cores  /  die  increasing  each  generation 

Multiplies  off-chip  bus  transactions  by  N  /  2*Sqrt(2) 
More  threads  per  core,  and  increase  in  virtualization 
Multiplies  off-chip  bus  transactions  by  N 
Processors  /  SMP  increasing 

•  Aggravates  queueing  throughout  the  system 
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urn 


Without  more  bandwidth  at  low  latencies,  core  counts 


3D  extends  transfer  of  performance  from  the  device  to  the  core  level 


Design  rule-  pitch  (  pm  ) 


DARPA  MTS 


March  6,  2007 


©  2006  IBM  Corporation 


SpecFP 

Thousands 


POWER  Series  Architectural  Perf  Contributions 
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Symmetric 

Multi-threading 
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Multi-core  Processor 
Deep  Pipelining 
Out-of-Order  Execution 
Register  Renaming 

64  Bit  Dataflow 


POWER3  POWER4  POWER4+  POWERS  POWER5+ 

Processor  Perf  Data  Delivery 

Transaction  Rate  Dependence 
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From  ISCA  ’06 

Components  of  Processor  Performance  pI'S™  ibm  by 


CPI 

A  MIPS  =  f /CPI 


Slope  =  Miss  Penalt 
(Cycles  Per  Mi 


Finite 

Cache 

Effect 


■Inf.  Cache  Perf: 


-E  Busy. 


E  Not  Busy 

I 


Miss  Rate 


Delay  is  sequentially  determined  by  a)  ideal  processor, 
b)  access  to  local  cache,  and  c)  refill  of  cache 
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From  ISCA  ’06 
Keynote  address  by 
Phil  Emma,  IBM 

Queueing  Effects  vs.  Log  Miss  Rate 
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Log  2  (Instructions  per  Miss) 

I  =  Relative  Performance,  I  =  Bus  Utilization 


From  ISCA  ’06 
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What  Is  Bandwidth  Used  For? 


In  a  computer,  it  is  mostly  for  handling  cache  misses: 


Miss  Access  Processor 
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Miss  Penalty  =  Leading  Edge  +  Effects(Trailing  Edge) 


3D  -  Bandwidth  and  Latency 


From  “New  Dimensions  in  Microarchitecture' 
K.  Bernstein,  MICRO-39,  12/06 


Processor  load  trade-off  , 
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between  I/O  Bandwidth, 

Bus  Latency.  — 

I  3 

-  For  generic  workloads,  ^ 

uni-processor  perf  8 

saturates  bandwidth  k  2 

benefit,  becomes  Hj 

latency-limited.  ^ 

o  1 

3*. 

< 

-  As  core  counts  increase, 

I/O  Bandwidth  becomes  0 
increasingly  important 


Bandwidth  and  Latency  Boundaries 


♦  Single  Core 
A  Double  Core 
A  Quad  Core 


Bandwidth,  GB/Sec  (Norm) 


3D  opportunity  for  improving  High  Perf  Compute  thru- 
put  in  sustaining  a  higher  number  of  cores  per  chip 
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Low  Vdd  Technology  and  Parallelism 

Energy  optimum  for  fixed  performance  as  function  of  Vdd,  VT  and  effectiveness  of 
parallelism 

a  determines  the  device  (circuit)  over  head  to  maintain  constant  performance  through 
parallelism 

a=l  no  overhead:  half  the  speed  double  the  devices 
a  >  1  increasing  overhead:  passive  power  becomes  dominant 

Energy  per  Transaction  at  Constant 
Performance 


TWo  Classes  of  3DI  Processes  at  I BM 
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Advantage:  Smallest  3D 
vias 


Bulk  top  fayer 

Advantage:  Broader  foundry 
compatibility 
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Summary 


AP  architecture  tricks  to  avoid  atomistic,  QM  scaling 
boundaries  overwhelm  present  interconnect 
technologies 

Integration  into  Z-plane  again  postpones  interconnect- 
related  limitations  to  extending  classic  scaling. 

No  aspect  of  architecture  or  technology  remains  2D,  so 
why  even  view  chips  as  being  monolithic  anymore? 

Transaction  retirement  rate  dependence  on  data  delivery 
is  increasing:  dependence  on  AP  performance  and 
CMOS  device  speed  is  decreasing 

3D  Integration  improves  storage  density  and  access  to 
that  storage 

The  last  remaining  opportunity  in  CMOS  to  save  power 
is  in  delivery  of  data  rather  than  in  its  generation. 
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